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Abstract. Algorithms are developed for calculating dcaliased linear convolution sums without 
the expense of conventional zero-padding or phase-shift techniques. For one-dimensional in-placc 
convolutions, the memory requirements are identical with the zero-padding technique, with the im- 
portant distinction that the additional work memory need not be contiguous with the input data. 
This decoupling of data and work arrays dramatically reduces the memory and computation time re- 
quired to evaluate higher-dimensional in-place convolutions. The technique also allows one to dealias 
the higher-order convolutions that arise from Fourier transforming cubic and higher powers. Implic- 
itly dealiased convolutions can be built on top of state-of-the-art fast Fourier transform libraries: 
vectorized multidimensional implementations for the complex and centered Hermitian (pseudospec- 
tral) cases have been implemented in the open-source software FFTW++. 
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1. Introduction. Discrete linear convolution sums based on the fast Fourier 
transform (FFT) algorithm [Till] have become important tools for image filtering, dig- 
ital signal processing, and correlation analysis. They are also widely used in periodic 
domains to solve nonlinear partial differential equations, such as the Navicr-Stokcs 
equations. In some of these applications, such as direct numerical pseudospectral 
simulations of turbulent fluids, memory usage is a critical limiting factor, and self- 
sorting in-placc multidimensional Fourier transforms |14j are typically used to reduce 
the memory footprint of the required spectral convolutions. 

It is important to remove aliases from FFT-based convolutions applied to non- 
periodic (wavenumber-space) data because they assume cyclic input and produce 
cyclic output. Typically the input data arrays are extended by padding them with 
enough zeros so that wave beats of the positive frequencies cannot wrap around and 
contaminate the negative frequencies. A cyclic convolution ^2^ = q f P gk-p is then 
performed using a correspondingly larger Fourier transform size N. If the cost of 
computing a complex Fourier transform of size N is asymptotic to K N log 2 N as 
TV — > oo (the lowest bound currently achievable is K = 34/9 [H1[T0]), the asymptotic 
cost of computing the convolution of two vectors of unpadded length m is 6Km log 2 m 
(using three Fourier transforms with N = 2m). 

Another important case in practice is the centered Hermitian ID convolution, 
dealiased by the 2/3 padding rule [11]. Since the computational cost of complex-to-real 
and real-to-complex Fourier transforms of size N = 3m is asymptotic to ^KN log 2 -^i 
the FFT-based Hermitian convolution X^Lfc— m +i fp9k-p requires three transforms 
and hence |ifmlog 2 m operations. Alternatively, phase shift dealiasing [T2J[3] can be 
used to cancel out the aliasing errors between two convolutions with different phase 
shifts. However, this second technique is rarely used in practice since, in addition to 
doubling the memory requirements, it is computationally more expensive, requiring 
6Km log 2 m operations for a centered Hermitian convolution. 
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An explicit application of the zero-padding technique involves the rather obvious 
inefficiency of summing over a large number of data values that are known a priori 
to be zero. However, it is worthwhile to consider the response provided by Steven 
G. Johnson to a frequently asked question about the possibility of avoiding this ex- 
pense [5]: 

The most common case where people seem to want a pruned FFT 
is for zero-padded convolutions, where roughly 50% of your inputs are 
zero (to get a linear convolution from an FFT-based cyclic convolu- 
tion). Here, a pruned FFT is hardly worth thinking about, at least 
in one dimension. In higher dimensions, matters change (e.g. for 
a 3d zero-padded array about 1/8 of your inputs are non-zero, and 
one can fairly easily save a factor of two or so simply by skipping Id 
sub-transforms that are zero). 
The reasoning behind the assertion that such one-dimensional pruned FFTs are not 
worth thinking about is that if only m of the N inputs are nonzero, the computa- 
tional cost is reduced only from Alog 2 N to N\og 2 m. For example, if m = N/2, the 
savings is a minuscule l/log 2 N. Moreover, since a zero-padded Fourier transform of 
size N yields N (typically nonzero) output values, no storage savings appear possi- 
ble in one dimension. Nevertheless, in this work we demonstrate that pruning the 
zero-padded elements of one-dimensional convolutions is still worth thinking about, 
primarily because this provides useful building blocks for constructing more efficient 
multidimensional convolutions. 

The key observation is this: although the memory usage of our implicitly padded 
ID convolution is identical to that of a conventional explicitly padded convolution, 
the additional temporary memory need not be contiguous with the user data. In a 
multidimensional context, this external work buffer can be reused for other low dimen- 
sional convolutions. As a result, in d dimensions an implicitly dealiased convolution 
asymptotically uses l/2 d_1 of the storage space needed for an explicitly padded con- 
volution. When the Fourier origin is centered in the domain, memory usage is reduced 
to (2/3) d_1 of the conventional amount. If saving memory alone were the goal, this 
reduction could also be achieved with explicit zero padding by copying the data for 
the innermost convolution to an external padded buffer, but such extra data com- 
munication would degrade overall performance. The fact that our one-dimensional 
convolution does not require this extra copying is the main feature that was exploited 
to obtain simultaneous improvements in memory usage and speed. 

Nevertheless, the task of writing an efficient implicitly dealiased one-dimensional 
convolution is onerous, particularly if one tries to compete with a problem-dependent, 
architecture-adaptive FFT algorithm (like that provided by the award- winning FFTW 
[6] library, which empirically predetermines a near optimal butterfly graph at each 
subdivision). Effectively one wants to perform the outer FFT subdivision manually, 
dropping the zero terms and deferring the computation of the inner transforms to a 
standard library routine. But this also restricts the set of available platform-dependent 
algorithms that can be used at the highest level. Fortunately, several notable features 
of our algorithm help to offset this disadvantage. First, if the goal is to produce a 
convolution, bit reversal for the hand-optimized outermost subdivision is unnecessary: 
the scrambled Fourier subtransforms of the two input vectors can be multiplied to- 
gether as they are produced (perhaps while still accessible in the cache). Second, the 
implicit method allows most of the subtransforms for an in-place convolution to be 
optionally computed as out-of-place transforms, which typically execute faster than 
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in-place transforms (cf. Figs. 1-6 of [5]) since they require no extra (pre-, post-, or 
interlaced) bit-reversal stage. These savings help keep our one-dimensional in-place 
implicit convolution competitive with an explicitly padded convolution based on the 
same highly optimized library. 

In Section [21 we develop an algorithm for Fourier transforming a one-dimensional 
zero-padded vector without the need for explicit padding. We show how this algo- 
rithm can be used to calculate implicitly padded convolutions for both general and 
Hcrmitian inputs. We describe how implicit padding may be applied to compute the 
discrete Fourier transform of an input vector padded beyond an arbitrary fraction p/q 
of its length. Building on these one-dimensional algorithms, implicitly padded convo- 
lutions are implemented for two- and three-dimensional input in Section [3] Finally, 
in Section 21 we show for both general and Hermitian data how implicit padding may 
be used to dealias higher-order convolutions efficiently 

2. Implicitly dealiased ID convolutions. In this section we describe the 
optimized ID building blocks that are used in subsequent sections to construct higher- 
dimensional implicitly dealiased convolutions. 

2.1. Complex convolution. The Fourier origin for standard convolutions is 
located at the first (zero) index of the array. Therefore, input data vectors of length to 
must be padded with zeros to length N > 2m — 1 to prevent modes with wavenumber 
m — 1 from beating together to contaminate mode N = OmodiV. However, since FFT 
sizes with small prime factors in practice yield the most efficient implementations, it 
is normally desirable to extend the padding to TV = 2m. 

In terms of the iVth primitive root of unity, (jv = exp (2m /N) (here = is used to 
emphasize a definition), the unnormalized backward discrete Fourier transform of a 
complex vector {U k ■ k = 0, . . . , N} may be written as 

JV-l 

». ' X! s v ' •'• ■ j = 0,...,2V-l. 

fc=0 

The fast Fourier transform method exploits the properties that £]v = Cn/t an d Cn = 1- 
On taking N = 2m with U k = for k > to, one can easily avoid looping over the 
unphysical zero Fourier modes by decimating in wavenumber: for £ = 0, 1, . . . ,m — 1: 

m — 1 m — 1 m — 1 m — 1 

(2-1) uu = £ (™U k = £ CU k , u 2e+1 = £ dT^Uk = E &&nUu. 

k=0 k=0 k=0 k=0 

This requires computing two subtransforms, each of size m, for an overall computa- 
tional scaling of order 2m log 2 m ~ N log 2 to. 

The odd and even terms of the convolution can then be computed separately 
(without the need for a bit reversal stage), multiplied term by term, and finally 
transformed again to Fourier space using the (scaled) forward transform 

2m — 1 m—1 m—1 

2mU k = £ Cguj = £ Gfuu + £ C 2 m ( " +1 W + i 
j=G e=o e=o 

m — 1 771 — 1 

(2-2) = £ C~ U U2t + C 2 m E Cm W «M+l, fc = 0, . . . , m - 1. 

i=a 1=0 
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The implicitly padded transforms described by Eqs. (|2.ip and (|2.2I) arc implemented 
as Procedure |f f tpadBackwardj and |f f tpadForward| These algorithms are combined 
in Function icconvi to compute a dealiased convolution of unpadded length m using two 
arrays of size m as input vectors instead of one array of size 2m. This seemingly trivial 
distinction is the key to the improved efficiency and reduced storage requirements of 
the higher-dimensional implicit convolutions described in Section [3] Moreover, in 
Function icconvi we see that implicit padding allows each of the six complex Fourier 
transforms of size m to be done out of place. In the listed pseudocode, an asterisk 
(*) denotes an element- by- element (vector) multiply. 

In principle, the stable trigonometric recursion described by Buneman [2], which 
requires two small precomputcd tables, each of size log 2 N, could be used to compute 
the required roots of unity (j^ that appear in Eqs. (|2.1[) and (|2.2[) P1 While Buneman's 
recursion has the same average accuracy as an FFT itself |13j , on modern hardware 
a factorization method that does not rely on successive table updates turns out to be 
more efficient [9] , at the expense of somewhat higher memory usage. We instead calcu- 
late the factors with a single complex multiply, using two short precomputcd tables 
H a = Cpf an d Lb = C/vi where k = as + b with s = [^m\ , a = 0, 1, . . . , \m/s] — 1, and 
b = 0, 1, . . . , s — 1. Since these one-dimensional tables require only 0(y/m) complex 
words of storage and our focus is on higher-dimensional convolutions anyway, we do 
not account for them in our storage estimates. 

Referring to the computation times shown in Fig. 12.11 we see that the implicit 
padding algorithm described by Eqs. (|2.1[) and (|2.2[) can thus be implemented to be 
competitive with explicitly padded convolutions. The error bars indicate the lower 
and upper one-sided standard deviations 



i=i V 2 i=i 

t i <T *i>T 

where T denotes the mean execution time of n samples. Both the FFTW-3.2.2 li- 
brary and the convolution layer we built on top of it were compiled with the In- 
tel C/C++ 11.0 20081105 compiler, using the optimization options -ansi-alias 
-malign-double -fp-model fast=2 on a 64-bit 3GHz Intel E5450 Xenon proces- 
sor with 6MB cache. Like the FFTW library, our algorithm was vectorized for this 
architecture with specialized single-instruction multiple-data (SIMD) code. 

To compare the normalized error for the two methods, we considered the in- 
put vectors = Fe lk and gu = Ge lk for k = 0, . . . , m — 1, with F = V% + i\fl 
and G = y/E + iy/Tl. The Fourier transforms of these vectors have nonzero com- 
ponents for all transform sizes. In Fig. 12.21 we compare the normalized L 2 error 

\Jj2™ = Q \hk — Hk\ 2 1 \Jj2T=o l^fc| 2 f° r each of the computed solutions hk relative to 

the exact solution Hy. = X)p=o fp9k-p — FG(k + l)e lk . 

2.2. Implicitly dealiased centered Fourier transform. A basic building 
block for constructing multidimensional centered convolutions is an implicitly dealiased 
centered Fourier transform, where the input data length is odd, say 2m — 1, with the 
Fourier origin at index m — 1. Here, one needs to pad to N > 3m — 2 to pre- 
vent modes with wavenumber m — 1 from beating together to contaminate the mode 




x We note that, in terms of the smallest positive number e satisfying 1 + e > 1 in a given machine 
representation, the singularity in Buneman's scheme can be removed by replacing secn/2 with 5/e, 
sin47r with — e/2, and sin27r and sin7r each with — e/10. 
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Input: vector f 

Output: vector f, vector u 

for k = to to — 1 do 

I u[k\ <- cLn% 

end 

f <- fft" 



L (0; 

fft- x (u); 



Procedure f f tpadBackward(f,u) stores 
the scrambled 2?n-padded backward 
Fourier transform values of a vector f of 
length to in f and an auxiliary vector u 
of length m. 



Input: vector f, vector u 

Output: vector f 

f <- fft(f); 

u <- fft(u); 

for k = to to — 1 do 

| f[*] <- f[*] + ££u[fc]; 
end 

return f/ (2m); 



Procedure f f tpadForward(f,u) returns 
the inverse of f f tpadBackward(f,u). 



Input: vector f, vector g 
Output: vector f 



fft-^f) 



fft" 



u <— u * v; 

for k = to to — 1 do 



m <- c 2 fc m f[fc] 



s[*] 



end 



fft-!(f); 



fft 



-i / 



v <s— v * f ; 

f «- fft(u); 
u <- f ft(v); 

for k = to m — 1 do 

| f[fc]^f[fc] + C 2m fc u[fc]; 

end 

return f/(2m); 



Function cconv(f,g,u,v) computes an 
in-placc implicitly dealiased convolution 
of two complex vectors f and g using 
two temporary vectors u and v, each of 
length to. 
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Fig. 2.1. Comparison of computation times 
for explicitly and implicitly dealiased complex in- FlG - 2 - 2 - formalized L error /or expftc- 

p/ace JD com^uiions o/ Zenfftt m. The storage M V and dealiased complex m-place ID 

requirements of the two algorithms are identical, convolutions of length m. 
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with wavenumber — m + 1. The ratio of the number of physical to total modes, 
(2m — l)/(3m — 2), is asymptotic to 2/3 for large m (TT] . 

For explicit padding, one usually chooses the padded vector length TV to be a 
power of 2, with to = \_(n + 2)/3j . However, for implicit padding, it is advantageous 
to choose to itself to be a power of 2 since the algorithm reduces to computing FFTs 
of length to. Moreover, it is convenient to pad implicitly slightly beyond 3to — 2, to 
TV = 3to, as this allow the use of a radix 3 subdivision at the outermost level, so that 
only two of the three subtransforms of length to need to be retained. 

Suppose then that Uk = for k > to. On decomposing j = (3£ + r) mod TV, where 
r€ {-1,0,1}, the substitution k' = to + k allows us to write the backward transform 
as 

m—1 m — 1 m — 1 m — 1 

U3t+r = ^ CmQm U k = ^ Cm din " Uk'-m + ^ Cm(sm U k = ^ Cm^k^r, 
fc=-m+l fc' = l k=0 fe=0 

(2.3) 
where 

OAS ^f U ° [ik = °' 

1 j k ' r ~ \Qi(Uk + Q r Uk- m ) ifi<fc<TO-i. 

The forward transform is then 

1 ro-1 

(2.5) 3mU k = E (nf^Zl+r , k = -m + 1, . . . , TO - 1. 

r=-l i=0 

The use of the remainder r = — 1 instead of r — 2 allows us to exploit the optimization 
Cim = dm m Eqs. (|2.4p and (|2.5p . The number of complex multiplies needed to 



^3m S3m 

evaluate Eq. (|2.4p for r = ±1 can be reduced by computing the intermediate complex 
quantities 

Ak = CL ( Re u k + (s 1 Re U k . m ) , 
B k = zCL (Im [T fc + C3" 1 Im C/ fe _ m ) , 

where C^ 1 = ( — ^, — 2 )' 80 ^^ a ^ ^ or ^ > ^' ^fc- 1 = ^fe + -^fc and Wk.-i — A k — B k . 
The resulting transforms, Procedures |f f tOpadBackward] and |f f tOpadForward) each 
have an operation count asymptotic to 37VTO,log 2 TO. We were able to implement 
strided multivector versions of these algorithms since they operate fully in place on 
their arguments, with no additional storage requirements. 

2.3. Centered Hermitian convolution. In this frequently encountered case 
(relevant to the pseudospectral method), each input vector is the Fourier transform 
of real-valued data; that is, it satisfies the Hermitian symmetry U-k = Uk- While 
the physical data represented is of length 2to — 1, centered about the Fourier origin, 
the redundant modes (corresponding to negative wavenumbers) are not included in 
the input vectors. The input vectors are thus of length to, with the Fourier origin at 
index 0. Just as in Section [2.21 the unsymmetrized physical data needs to be padded 
with at least to — 1 zeros. Hermitian symmetry then requires us to pad the to non- 
negative wavenumbers with at least c = L TO /2J zeros. The resulting 2/3 padding ratio 
(for even to) turns out to work particularly well for developing implicitly dealiased 
centered Hermitian convolutions. As in the centered case, we again choose the Fourier 
size to be N = 3to. 
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Input: vector f 

Output: vector f, vector u 

u[0] «- f[m-l]; 

for fc = 1 to m — 1 do 



Ref[m — 1 

Imf [to — 1 

f [to -! + £]«-, 



fc] H 
hfc] 



B: 



l 

2 ' 



v/5 



Ref[fc] 



|,-^)lmf[fc] 



f[0] <- f[fc]; 

f[fc] <-f[fc]+f[m-l + fc]; 
end 

f [0, . . . , to — 1] <- fft- x (f[0, ...,r 
u[to] <— f [to — 1]; 
f[m-l] «- u[0]; 

f[m - 1, . . . ,2m - 2] <- fft _1 (f[m- 1,. 
u [0 , . . . , to — 1] <- fft _1 (u[0,...,m- 1] 



i]); 



,2m -2]); 



Procedure f f tOpadBackward(f,u) stores the scrambled 3m-padded centered back- 
ward Fourier transform values of a vector f of length 2to — 1 in f and an auxiliary 
vector u of length m + 1. 



Input: vector f, vector u 
Output: vector f 

f [to - 1, ... , 2m - 2] «- f f t(f [to - 1, . . . , 2m - 2]); 
u[to] <-> f [to — 1]; 

f [0, . . . , to - 1] <- f f t(f [0, . . . , m - 1]); 
u[0,...,to- 1] <- fft(u[0,...,TO- 1]); 
u[m] «- f [0] + u[m] + u[0]; 
for fc = 1 to to — 1 do 

f[fc - 1] = f [fc] + (-i ^)c^[rn - 1 + fc] + (-|, -^)c| m u[fc]; 

f[m - 1 + fc] = f[fc] + ££f[m - 1 + fc] + C 3 fe m u[fc]; 
end 

f [to — 1] 4— u[m]; 
return f/ (3m); 

Procedure f f tOpadForward(f,u) returns the inverse of f f tOpadBackward(f,u). 



Given that Uk — for fc > to, the backward (complcx-to-rcal) transform appears 
as Eq. (|2.3j) , but now with 

f0(t , ^(U = 0, 

(Z - b) ^ ~ \ £* {U k + C3 r Urn-k) if 1 < fc < TO - 1. 

We note that Wk, r obeys the Hermitian symmetry Wk. r = w m —k. r , so that the Fourier 
transform J2T=o Cm w k,r in Eq. (|2.3j) will indeed be real valued. This allows us to build 
a backward implicitly dealiased centered Hermitian transform using three complcx- 
to-real Fourier transforms of the first c + 1 components of Wk.r (one for each r G 
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{ — 1,0, 1}). The forward transform is given by 

1 m— 1 

(2.7) 3mU k = C 3 ™ fe E C™*«3/+r, fc = 0, . . . , m - 1. 

r=-l £=o 

Since U3£+ r is real, a real-to-complex transform can be used to compute the first 
c+1 frequencies of Y^eLo 1 Cm*' " u 3£+ri the remaining to — c — 1 frequencies needed in 
Eq. (|2.7|) are then computed using Hcrmitian symmetry. 

Since there are two input vectors and one output vector, the complete convolution 
requires a total of nine Hermitian Fourier transforms of size to, for an overall computa- 
tional scaling of ^Km\og 2 to operations, in agreement with the leading-order scaling 
of an explicitly padded centered Hermitian convolution. For simplicity, we document 
here only the practically important case to = 2c; minor changes are required to im- 
plement the case m = 2c + 1. We see in Function iconvi that seven out of the nine 
Fourier transforms can be performed out of place using the same amount of memory, 
2([N/2\ +1) = 6c + 2 words, as would be used to compute a centered Hermitian 
convolution with explicit padding. 

To facilitate an in-place implementation of the backward transform, we store the 
conjugate of the transformed values for r = 1 in reverse order in the upper half of the 
input vector, using the identity (for real Uj) 

c m — 1 

<o - E c jfc c4= E a (fc '- c) ^=(-i) j 'Ec4 fc ^ 

fc— — c+1 k' = m— 1 fc— 

obtained with the substitution fc' = c — fc. One can omit the factors of (— iy here since 
they will cancel during the real term-by-term multiplication of the two transformed 
input vectors. 

As seen in Fig. l2.3[ the efficiency of the resulting implicitly dcaliascd centered Hcr- 
mitian convolution is comparable to an explicit implementation. For each algorithm, 
we benchmark only those vector lengths that yield optimal performance. 

To check the accuracy of our implementation, we used the test case fk = Fe lk 
and gk = Ge lk for k = 0, ...,m — 1, with F = \/3 and G = v5; noting that 
Hcrmitian symmetry requires that F and G be real. The exact solution is Hk = 
FG T,™=k-m+i e ip e i( - k - p) = FG(2m - 1 - k)e lk . The normalized L 2 errors for im- 
plicit and explicit padding are compared in Fig. 12.41 

2.4. General padding. Implicit padding corresponding to an arbitrary p/q rule 
is also possible. Suppose that pm data modes are zero padded to TV = qm, where p 
and q are relatively prime. One decomposes j = qi + r, where I = 0, . . . , to — 1 and 
r = 0, . . . , q — 1. Similarly, one expresses k = tm + s, where t = 0, ... ,p — 1 and 
s = 0, . . . , to — 1: 

pm—l m—Xp—1 

Uql+r = e c^ +r)k Uk = EE^ 4m+s) c;£ m+s) t4 

fc=0 s=0 t=0 

Since there are q choices of r, the problem reduces to computing q Fourier transforms 
of length to, which requires Kqm\og 2 m = KN\og 2 (N/q) operations. Likewise, the 
forward implicit transform 

m— 1 q—l — l m— 1 

qmUk = E E C^ +r)fc <V+r = E E C'V+r, k = 0, . . . ,pm - 1 

i=0 r=0 r=0 i=0 



m—1 p— 1 

Ea^s V * fr(tm-\-s)rr 

s=0 t=0 
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Input: vector f, vector g 

Output: vector f 

F<-f[c]; 

build(f ,u); 

C <- f[c]; 

f[c] 2 Re F; 

u[c] <- ReF+V3ImF; 

G«-g[c]; 

build(g,v); 

D <- g[c]; 

g[c] <- 2ReG; 

v[c] <- Re G + v^ImG; 

u <— crfft(u); 
v <- crfft(v); 
v v * u; 
u <- rcfft(v); 

v<- crfft(f[0,...,c]); 
f[0,...,c] <- crfft(g[0,...,c]); 
v v * f[0, . . . , c]; 
f[0, . . . , c] 4- rcfft(v); 

S 4- f[c- 1]; 
T«-f[c]; 

f[c- 1] = RcF - v^ImF; 
f[c] = C; 

g[c-l] =ReG-v / 3ImG; 
g[c] = D; 

v 4- crfft(g[c- 1,. . .,2c- 1]); 

g[c - 1, . . . , 2c- 1] <- erf f t(f [c - 1, . . . , 2c - 1]); 

g[c - 1, . . . , 2c - 1] «- g[c - 1, . . . , 2c - 1] * v; 

v«-rcfft(g[c-l,...,2c-l]); 

for fe = 1 to c — 2 do 

f[fc] = f[A:]+Ce c \[A : ] + C 6 fc c u[fc]; 

f[2c- k] = f[k] + (-1, -^)&v[fc] + (-1 ^)Ca?u[*]; 
end 

f [c - 1] = 5 + CeV^tc - 1] + Clc" 1 "^ - 1]; 
f[c]=T-(-i,^)v[c]-(-i ) -^)u[c]; 
if c > 1 then 

f[c+i] = )c 6 T 1 ^ 3 iI+ (-I.^C^r^iI; 

end 

return f/ (6c); 



Function conv(f,g,u,v) uses Procedure build to compute an in-placc implicitly 
dealiased convolution of centered Hcrmitian vectors f and g of length 2c using tem- 
porary vectors u and v of length c + 1. 
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Input: vector f 

Output: vector f, vector u 

u[Q] <- f [Q]; 

F<-/[2c-l]; 

f[2c-l]^f[0]; 

for k = 1 to c — 1 do 



A«-C 



Ref[fc] 



1 \/3 

2 ' 2 



Re F 



B <- -<6c Imf [ fc ] 
f[fc] «- f [fc] + F; 
u[fc] <- A - B; 
F ^— f[2c — 1 - A:]; 
f [2c - 1 - k] <- A - 



end 



Procedure build(f,u) builds the FFT ar- 
rays required for Function iconvi from an 
unpadded vector f of length 2c into f and 
an auxiliary vector u of length c + 1 . 



bo 
o 



CD 




Fig. 2.3. Comparison of computation times 
for explicitly and implicitly dealiased centered 
Hermitian in-place ID convolutions of length 
2m - 1. 
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Fig. 2.4. Normalized L 2 error for explicitly and implicitly dealiased centered Hermitian in-pla 
ID convolutions of length m. 



also requires q Fourier transforms of length m. Again, the computational savings for 
a one-dimensional transform is marginal. 

3. Higher-dimensional convolutions. The algorithms developed in Section [2] 
can be used as building blocks to construct efficient implicitly padded higher-dimensional 
convolutions. 
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3.1. Complex 2D convolution. The implicitly padded 2D convolution shown 
in Function lcconv2l is designed for data stored with a stride of one in the y direction. 
Efficient multivector versions of Procedures |f f tOpadBackward| and |f f tOpadForward| 

are used for the transform in the x direction; this allows a single Cfm factor to be 
applied to a consecutive column of data at constant x. In principle, one could also 
develop a multivector version of Function icconvl to perform simultaneous convolutions 
in the y direction, but our timing tests indicate that this would only slightly enhance 
the overall performance (since the data for constant y is not stored consecutively in 
memory) and would prevent the ID convolution work arrays from being reused for 
the y convolution at each fixed x. The memory savings in our method comes precisely 
from this reuse of temporary storage, which in turn requires that the y convolutions 
be computed serially. 

As shown in Fig. 13.11 the resulting implicit 2D algorithm dramatically outperforms 
the explicit version: a 1024 2 implicit complex convolution is 1.91 times faster. 



CO 

a 



e - explicit 
■m- ■ ■ j/-pruned 
■ implicit 




Fig. 3.1. Comparison of computation times for explicitly and implicitly dealiased complex in- 
place 2D convolutions of size m? . 



The third sentence of the quote from Steven G. Johnson on page[T]suggests a sensi- 
ble optimization for explicitly padded 2D convolutions: one can omit for m < y < 2m 
the backward and forward Fourier transforms in the x direction. However potential 
data locality optimizations may be lost when a 2D convolution is expressed directly in 
terms of ID transforms: as observed in Fig. I3.1[ while a 1024 2 y-pruned explicit con- 
volution is 1.26 times faster than a conventional explicit implementation, the pruned 
method becomes 1.80 times slower for the 8192 2 case. Our implicitly dealiased con- 
volution is also subject to these same optimization losses, but the savings due to 
implicit padding, out-of-place subtransforms, the neglect of high-level bit reversal, 
and the immediate convolution of constant x columns (while still possibly in cache) 
outweigh these losses. 

Because the same temporary arrays u and v are used for each column of the 
convolution, the memory requirement is Am x m y + 2m y complex words, far less than 
the 8m x m y complex words needed for an explicitly padded convolution. 
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3.2. Centered Hermitian 2D convolution. In two dimensions, the Fourier- 
centered Hcrmitian symmetry appears as U_k -£ = This symmetry is exploited 
in the centered Hcrmitian convolution algorithm described in Function lconv2l As 
shown in Fig. 13.21 implicit padding again yields a dramatic improvement in speed. 

When m y is even, the memory usage for an implicitly dealiased (2m x — 1) x (2m y — 
1) centered Hermitian convolution is 2(2m x — ^)m y + 2{m x + l)m y + 2(m y /2 + 1) = 
6m x m y + m y + 2 complex words, compared with a minimum of 2(3m x — 2)(3m y /2) = 
9m x m y — 6m y complex words required for an explicitly dealiased convolution. 



Input: matrix f, matrix g 

Output: matrix f 

for j = to m y — 1 do 

f ftpadBackward(f T [j], U T [j]); 

f f tpadBackward(g T [j], V T [j]); 
end 

for i = to m. x — 1 do 

cconv(f[«],g[i], u,v); 

cconv(U[i], V[i], u, v); 
end 

for j = to m y — 1 do 
| fftpadForward(f T [j],U T [j]); 
end 

return f ; 



Function cconv2 (f,g,u,v,U,V) returns an 
in-place implicitly dealiased convolution 
of m x x m y matrices f and g using tem- 
porary m x x m y matrices U and V and 
temporary vectors u and v of length m y . 



Input: matrix f, matrix g 

Output: matrix f 

for j = to m y — 1 do 

f ftOpadBackward(f T [j], U T [j]); 

f f tOpadBackward(g T [j] , V T [j] ) ; 
end 

for i = to 2m x — 2 do 

conv(f[i],g[i],u,v); 
end 

for % = to m x do 

conv(U[i],V[i], u,v); 
end 

for j = to m y — 1 do 

f ftOpadForward(f T [j], U T [j}); 
end 

return f ; 

Function conv2 (f,g,u,v,U,V) returns an 
in-place implicitly dealiased centered 
Hermitian convolution of (2m x — 1) x m y 
matrices f and g using temporary (m x + 
1) x m y matrices U and V and vectors u 
and v of length m y . 



3.3. 2D pseudospectral application. In our implementation, we allow the 
convolution inputs to be arrays of vectors, fi and gi (i = 1,...,M), interpreting 
in Functions icconvi iconvi and Itconvl the product / * g as the element-by-element 
dot product /; * <?.; . Convolving M input data blocks simultaneously like this 
enables, for example, the nonlinear term of the 2D incompressible Euler equation to 
be computed using five Fourier transforms (instead of three for each of the M = 2 
input pairs). Specifically, the advective term -ti'Vw = -(ixVV" 2 w)'Vw, which 
appears in Fourier space as 

EPxky p y k x 
i \k-p\ 2 WpWk ~ p > 

can be computed efficiently with the call c(mv2(ik x uj,ik y uj,ik y u!/k 2 ,—ik x u)/k 2 , u,v), 
where u and v are work arrays. 

3.4. Complex and centered Hermitian 3D convolutions. The decoupling 
of the 2D work arrays in Function I c c onv2l facilitates the construction of an efficient 3D 
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implicit complex convolution, as described in Function lcconv31 As shown in Fig. [ 
an implicit 256 3 convolution is 2.38 times faster than the explicit version, while an xz- 
pruned version is only 1.24 times faster. The memory usage of an implicitly padded 
3D m x x my x m z convolution is 4m x m y m z + 2m y m z + 2m z complex words, far less 
than the 16m x m y m z complex words required by an explicit convolution based on 
powcr-of-two transforms. 

A (2m x — 1) x (2m y — 1) x (2m! — 1) implicit centered Hcrmitian 3D convolution 
was also implemented in an analogous manner. It required 

6m x (2m y — l)m z +2(m y + l)m z +2(m z /2+l) = 12m x m y m z —6m x m z +2m y m z +m z +2 

complex words, in comparison with the usual requirement of 27m x m y m z complex 
words for explicit padding with power-of-two transforms. 

4. Implicitly dealiased ternary convolutions. In this section, we show that 
implicit padding is well suited to dealiasing the centered ternary convolution 

m— 1 rn — 1 m — 1 

y y f P 9qhrS P + q + r ,k, 

p=— m+l q— — m-\-l r— — m+l 

which, for example, is required to compute the time evolution of the Casimir invariant 
J u! 4 dx associated with the nonlinearity of two-dimensional incompressible flow ex- 
pressed in terms of the scalar vorticity uj. The basic building blocks for this problem 
are again the centered Fourier transform and Hcrmitian convolution. 

4.1. Implicit double-dealiased centered Fourier transform. Here the in- 
put data length is 2m — 1, with the Fourier origin at index m — 1, so one needs to pad 
to N > Am — 3 to prevent contamination due to wave beating. 
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Input: vector f 
Output: vector f, vector u 
u[0] = f[0]=0; 
for k = 1 to 2m — 1 do 
| u[k] <- -Klj[k]\ 
end 

f <- fft- x (f); 
ii 4— f ft _1 (i]V 
return f; 


Input: matrix f, matrix g 
Output: matrix f 
for j = to m,y — 1 do 
for k = to m z — 1 do 

f f tpadBackward(f T [k] [j] , U T [k] 
f f tpadBackward(g T [k] [j] , V T [k] [j] ) ; 
end 
end 

for i = to m x — 1 do 

cconv2(f[i],g[i], ui, vi, u 2 , v 2 ); 
cconv2(U[iJ, V[i\, ui, vi, u 2 , v 2 ); 
end 

for j = to m a — 1 do 

for k = to m 2 — 1 do 

fftpadForward(f [k][j],U [k][j]); 

end 
end 

return f; 


Procedure f f tOtpadBackward(f,u) 

stores the scrambled signed 4m- 
padded centered backward Fourier 
transform values of a vector f of 
length 2m in f and an auxiliary 
vector u of length 2m. 


Input: vector f, vector u 
Output: vector f 
f <- fft(f); 
u «- fft(u); 

for fc = 1 to 2m — 1 do 
| f[*]<-f[A] + i^u[*]; 
end 

return f/ (4m); 


Function cconv3(f,g) returns an in- 
place implicitly dcaliascd convolution of 
m x x m y x m z matrices f and g using tem- 
porary m x x m y x m z matrices U and V, 



m B x m z matrices u 2 and v 2 , and vectors ui Procedure f f tOtpadForward(f,u) 
and Vi of length m z . returns the inverse of Procedure 

f ftOtpadBackward(f,u). 



Implicit padding is most efficiently implemented by padding the input vector with 
a single zero complex word at the beginning, to yield a vector of length 2m, with the 
Fourier origin at index m. We choose m to be a power of 2 and N = Am, with 11^=0 
for k = —m and k > m. 

On decomposing j = 21 + r, where I = 0, . . . , 2m — 1 and r <G {0, 1}, we find on 
substituting k' — k + m that 

m— 1 2m — 1 2m — 1 

_ /-£k r rk tt _ Mk'-m)Mk'-m) T7 -,\C--r /-Ik /-rk TT 

U2t+r — 2_^i ^2mUm U k — ^ <, 2m Urn u k'-m — 1 2-/ ^2mUm U k-r, 

k=-m fc'=0 k=0 

(4.1) 

The forward transform is then given for k = — m + 1, . . . , m — 1 by 

1 2m-l 

AmU k =J2^m E C 2 ~„?^+r 
r=0 C=0 
1 2m -1 

r=0 £=0 
1 2m-l 

= E^ fc ' <r E (/HIV, fc' = l,...,2m-l. 

r=0 «=0 
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For a ternary convolution, the product of the three factors (—1) (one for each input 
vector) arising from Eq. (|4.1[) and the factor (— \) 1 in Eq. (|4.2[) cancels. Procedures 
|f f tOtpadBackward| and |f f tOtpadForward] each have an operation count asymptotic 
to AKm log 2 to. As they operate fully in place on their arguments, with no additional 
storage requirements, it is straightforward to implement stridcd multivector versions 
of these algorithms. 

4.2. Implicitly dealiased centered Hermitian ID ternary convolution. 

Let us now consider a centered Hermitian ternary convolution with N = 4m, where m 
is a power of 2. For explicit padding, one needs to pad the m non-negative wavenum- 
bers with m + 1 zeros, for a total vector length of 2m + 1. 

On decomposing j = 21 + r, where I = 0, . . . , 2m— 1 and r € {0, 1}, the backward 
transform is given by 

m— 1 

U 2 £+r = ^2 ^2mClt U k- 
k— — m 

If we set U m = 0, the real components U2e+ r can thus be computed by taking a 
complex-to-real transform of {QmUk : fc = 0, . . . , m}. 
The forward transform is 

1 2m- 1 

Amllk = ^2 Qm Gm U 2t+r, fc = -TO + 1, . . . , TO - 1. 

r=0 1=0 

The resulting implicitly padded centered Hermitian ternary convolution, Function 
Itconvl has an operation count of SKm log 2 to. Five of the eight required Fourier 
transforms can be done out of place. In Fig. 14.11 we show that this algorithm is 
competitive with explicit padding. Function It c onvl req uires 6(m + 1) complex words 
of storage, slightly more than the 3(2m + 1) = 6m + 3 complex words needed for 
explicit padding. 

Just as for convolutions, the performance and memory benefits of dealiasing 
higher-order convolutions via implicit padding manifest themselves only in higher 
dimensions. For example, in Fig. 14.21 we observe for m x = m y = 4096 that the 
implicit (2m x — 1) x m y centered Hermitian ternary convolution computed with 
Function ltconv2l is 2.28 times faster than an explicit version. The memory us- 
age for a (2m x — 1) x m y implicit centered Hermitian ternary convolution is 6 ■ 
2TO x (TOj / + l)-|-3(mj, + l) = l2m x m y + l2m x -\-3m y -\-3 complex words, compared with 
3 • 4m x (2mj, + 1) = 24ma;TOj, + 12m x complex words (using powcr-of-two transforms) 
for an explicit version. In contrast, the corresponding y-pruned convolution requires 
the same amount of storage as, but is 1.42 times faster than, an explicitly padded 
version. 

5. Concluding remarks. Explicitly padded Fourier transforms are frequently 
used to dealias convolutions in one or more dimensions. In this work we have de- 
veloped an efficient method for avoiding explicit zero padding in multidimensional 
convolutions, thereby saving both memory and computation time. The key idea that 
was exploited was the decoupling of temporary storage and user data, which in higher 
dimensions allows the reuse of storage space. The resulting increased data locality 
significantly enhanced performance by as much as a factor of 2. The savings in mem- 
ory use, obtained by computing the Fourier transformed data in blocks rather than 
all at once, was equally significant: asymptotically, as m x — ¥ oo, an implicit complex 
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Input: matrix f, matrix g, matrix h 
Output: matrix f 
for j = to m,y — 1 do 

f f tOtpadBackward(f T [j] , U T [j] ) ; 
f f tOtpadBackward(g T [j] , V T [j] ) ; 
f f t OtpadBackward(h T [j] , W T [j] ) ; 
end 

for i — to 2m x — 1 do 

tconv(f[i],g[i], h[i], u,v,w); 
tconv(U[i], V[i], W[z], u, v, w); 
end 

for j = to m y — 1 do 

fftOtpadForward(f T [j], U T [j]); 
end 

return f; 

Function tconv2(f,g,h) returns an 
in-placc implicitly dealiased centered 
Hcrmitian ternary convolution of 
2ma; x (m y + 1) matrices f, g, and h 
using temporary 2m x x (m y + 1) matri- 
Function tconv(f,g,h,u,v,w) computes ces U, V, and W and vectors u, v and w 
an in-placc implicitly dealiased cen- of length m y + 1. 
tered Hcrmitian ternary convolution of 
three centered Hcrmitian vectors f, g. 
and h, using three temporary vectors 
u, v, and w, each of length m + 1. 



convolution requires one-half of the memory needed for a zero-padded convolution 
in two dimensions and one-quarter in three dimensions. In the centered Hermitian 
case, the memory use in two dimensions is 2/3 of the amount used for an explicit 
convolution and 4/9 of the corresponding storage requirement in three dimensions. 

Even in one dimension, where implicit padding can be implemented competitively 
with conventional methods, the method has notable advantages. For the intended 
application to partial differential equations, there is flexibility in the choice of the 
exact convolution size. This is why we consider for each algorithm only those vector 
lengths that maximize performance. On the other hand, for those applications where 
the size of the convolution is dictated by external criteria, implicit padding effectively 
expands the available set of efficient convolution sizes to include integral powers of 2, 
a case of practical significance. Canute et al. j3] p. 136] point out that if only a power- 
of-two transform were available for a centered convolution, zero padding a vector of 
length m = 2 k would require a transform size of 2m, yielding an even slightly higher 
operation count, 6Km log 2 (2m), than the 6Km\og 2 m operations required for phase- 
shift dcaliasing. The availability of implicitly dealiased convolutions now makes this 



Input: vector f, vector g, vector h 
Output: vector f 
for k = to m — 1 do 

u[k] <- cUH 
Ak] <- cLeM; 
w[k] <- cLM% 

end 

u[m] = v[m] = w[m] = 0; 
u «- erf ft _1 (u); 
v <- erf ft _1 (v); 
w crfft _1 (w); 
v u * v * w; 
u <- rcfft(v); 

f[m] = g[m] = h[m] = 0; 
v <- crfft _1 (f); 
w <- crfft _1 (g); 
g^crfft-^h); 
v <s— v * w * g; 
f <- rcf f t(v); 

for k = to m — 1 do 

| f[/c]^f[fc] + C4^u[fc]; 
end 

return f/ (4m); 
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Fig. 4.1. Comparison of computation times 
for explicitly and implicitly dealiased centered 
Hermitian in-place ID ternary convolutions of 
length m. 



Fig. 4.2. Comparison of computation times 
for explicitly and implicitly dealiased centered 
Hermitian in-place 2D ternary convolutions of 
size (2m — 1) X m. 



argument moot. 

Another advantage of implicit padding is the ability of the algorithm to work 
directly on raw unpadded user data without the inconvenience or extra storage re- 
quirements of a separate padding buffer. Having a prewritten, well-tested dealiased 
convolution that takes care of dcaliasing internally is a major convenience for the 
average user. For 2D and 3D Hermitian convolutions, a prepackaged routine should 
also automatically enforce Hermitian symmetry of the data along the x axis or the xy 
plane, respectively. With the highly optimized implementations of the algorithms 
developed in this work made available in the open source package FFTW++ pQ , writing 
a pseudospectral code for solving nonlinear partial differential equations should now 
be a relatively straightforward exercise. 
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